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ABSTRACT 

Most search engines index the textual content of documents in 
digital libraries. However, scholarly articles frequently report im- 
portant findings in figures for visual impact and the contents of 
these figures are not indexed. These contents are often invaluable 
to the researcher in various fields, for the purposes of direct com- 
parison with their own work. Therefore, searching for figures and 
extracting figure data are important problems. To the best of our 
knowledge, there exists no tool to automatically extract data from 
figures in digital documents. If we can extract data from these 
images automatically and store them in a database, an end-user 
can query and combine data from multiple digital documents si- 
multaneously and efficiently. We propose a framework based on 
image analysis and machine learning to extract information from 
2-D plot images and store them in a database. The proposed 
algorithm identifies a 2-D plot and extracts the axis labels, leg- 
end and the data points from the 2-D plot. We also segregate 
overlapping shapes that correspond to different data points. We 
demonstrate performance of individual algorithms, using a com- 
bination of generated and real-life images. 

Categories and Subject Descriptors 

Data Mining/Extraction [Information Systems Appli- 
cations] : 

General Terms 

Information Extraction, Machine Learning, Metadata 

1. INTRODUCTION 

A wide variety of quantitative information is summarized 
and visually presented using 2-D plots, including scientific 
results, business performance reports, time series, etc. The 
embedded information is invaluable in that once extracted, 
the data can be indexed and the end-user has the ability to 
query the data, and operate directly on the data. However, 
in order to extract information from figures without manual 



intervention, we must identify 2-D plot figures, segment the 
plots to extract the axes, the legend and the data sections, 
extract the labels of the axes, separate the data symbols 
from the text in the legend, identify data points and segre- 
gate overlapping data points. Performing all of these tasks 
automatically with high precision is a challenging problem 
and we believe that ours is the first attempt to achieve this 
goal. This paper is devoted to a subset of the overall process, 
specifically the identification of 2-D plots and disambigua- 
tion of overlapping data points. We perform content-based 
image analysis to identify appropriate features that char- 
acterize a 2-D plot from other figure types. Li, et al., [6] 
have shown that the histogram distribution of the wavelet 
co-efficients can effectively be utilized as a global image fea- 
ture for picture and non-picture classification. We adapt 
these methods by using additional features including line 
features determined after edge detection and hough trans- 
form, and the text surrounding the figure, e.g. the figure 
caption. Identifying data points from an image is a hard 
problem especially when multiple data points overlap. Typ- 
ically, a figure uses common symbols (triangle, square, circle 
etc.) to designate a series of data points in a two dimen- 
sional space. When data points overlap, the resulting irreg- 
ular shape does not exactly match with any regularly shaped 
data point. To extract data precisely from figures in digital 
documents, one must segregate the overlapping shapes and 
identify the shape and the center of mass of each overlapping 
data point. We employ simulated annealing, a stochastic 
optimization method to segregate these shapes and find the 
method to be fairly accurate. 

2. RELATED WORK 

The image categorization portion of our work bears a simi- 
larity to image understanding, however, we focus on deciding 
whether a given image contains a 2-D plot. Li et.al. [B] de- 
veloped wavelet transform, context sensitive algorithms to 
perform texture based analysis of an image, in separating 
camera taken pictures from non-pictures. Building on this 
framework, Lu et.al. [8] developed an automatic categoriza- 
tion image system for digital library documents which cat- 
egorizes the images into multiple classes within non-picture 
class e.g. diagram, 2-D figures, 3-D figures, diagrams and 
other. We find significant improvements in detecting 2-D fig- 
ures by substituting certain features used in [8]. [7] presents 
image-processing-based techniques to extract the data rep- 
resented by lines in 2-D plots. However, [7] does not ex- 



tract the data represented by data points and treats the 
data point shapes as noise while processing the image. Our 
work is complimentary in that we address the question of 
how to extract data represented by various shapes. 

3. PRELIMINARY 

Our algorithm segments a 2-D figure into three regions: 1) 
X-axis region containing X-axis labels and numerical units, 
i.e., area below the horizontal axis in Fig 1., 2) Y-axis con- 
taining labels and numerical units i.e. area to the left of ver- 
tical axis in Fig 1. and, 3) plotting region, which contains 
legend text, data points, and lines. A 2-D figure depicts a 
functional distribution of the form j/; = ft (x) with conditions Wi 
where Y-axis and X-axis labels contain the description for 
y and x data. The legend with textual content provides the 
particulars for conditions w, and the values for these func- 
tions are represented by the data points or the lines in the 
plot. 




Figure 1: A sample 2-D plot displaying experimen- 
tal results reported in [9j. The areas of interest in 
the diagram are namely X-axis, Y-axis and plotting 
region. 



4. METHOD 

4.1 Overview 

The system uses a machine-learning based classifier to iden- 
tify which figures in the document are 2-D plots. An identi- 
fied image is then segmented into the previously mentioned 
three regions. The algorithm performs connected compo- 
nent analysis to label each connected component in the three 
regions so that its shape and position can be further ana- 
lyzed. Next, the candidate text components are identified 
based upon their mutual positioning and spacing informa- 
tion. This identification is based upon the intuition that the 
two characters appearing in the same string are very likely 
to be placed next to each other. Also, the spacing between 
them is roughly the same for any two characters appearing 
in any other string of text in the figure. In the next stage, 
we identify the data points in the plotting region. This is 
achieved by removing the lines from the region in a manner 
whereby only the data points remain; Fig. 2 depicts the 
entire process. 

4.2 Identification of 2-D Plots 
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Figure 2: Process flow of Information extraction 
from 2-Dimensional Plot 



Image segment features: Li, et al. [B], have proposed an 
image segmentation algorithm that divides an image into 
small non-overlapping blocks. They use the wavelet coeffi- 
cients of each block as a localized feature to obtain global 
information on the text, background and picture regions. 
Lu, et al., [S] have found these localized features to be very 
effective in separating photo and non-photo images as well. 
Since 2-D plots are a subset of non-photo images, we use 
these features. Lu, et al., have noted that the finer as- 
pects of colors and shades do not contribute heavily towards 
identifying the "semantic type" of the figure. Therefore, in 
extracting the image segment features, we converted each 
image to grayscale (portable gray map or PGM) format. 

Axes Features: 2-D figures range from curve-fitted plots 
to histograms and pie-charts. We are primarily interested in 
2-D plots that graph the variation of a variable with respect 
to another variable and the presence of co-ordinate axes is 
certainly a distinguishing feature of such plots. We apply the 
Hough transform [4] on the binarized image to obtain the 
positional information of the longest straight lines, including 
their mutual angles (eg., X-Y axes are othogonal) and use 
these as features. 

Text Features: From our observations, we found that au- 
thors tend to employ certain terms in writing captions for 
2-D plots that are used less frequently in captions for other 
types of figures. For instance, re-occurring sets of words in- 
clude distribution, slope, axes, plot, range, etc. We use these 
words to form boolean features while training our classifier. 

Support Vector Machines(SVM) [T] are increasingly used in 
both 2-class and multi-class classifications for their robust- 
ness and computational efficiency when compared to other 
machine learning techniques. We train our classifier based 
on the afore-mentioned features using an SVM, and found 
that a linear kernel along with the C-parameter set to 1.0 
was best suited to our purposes. 

4.3 Data Point Disambiguation 

Overlapping data points occur frequently in 2-D plots and 
identifying each individual data point and its coordinates is a 
difficult task. We apply simulated annealing (SA) in order to 
resolve individual data points within a region of overlap. SA 



is a stochastic method, based on the Metropolis algorithm, 
often used in non-convex optimization problems. It bears 
close similarity to annealing (i.e. slow cooling) in metallur- 
gical processes. By analogy to its physical counterpart, the 
optimal configuration (lowest energy Emin) is approached 
while the temperature T is lowered. In accordance with the 
Metropolis algorithm, occasionally higher energy configura- 
tions Ef > Ei are assumed with probability e'^^f'^'^^"^. 
The specific details of the algorithm are presented below. 

We generate an 'initial configuration' image, which consists 
of large numbers of randomly selected candidate shapes, 
with random positions. Candidates are previously identified 
shapes extracted from the 2-D plot, using standard shape 
detection methods [10]. The target image consists of over- 
lapping data points, extracted from within the plotting re- 
gion, which has failed to be classified as a particular shape. 
Hence, we consider two matrices with binary (boolean) val- 
ues: the generated image and the original overlapping data 
point image. A Grammian matrix is constructed from the 
difference between these two matrices, the trace of which is 
used as a cost function, and is minimized iteratively as fol- 
lows. To begin with, the coordinates of the candidate shapes 
are given random fiuctuations, within the image boundary, 
which is determined by the size of the target image. In addi- 
tion, point types are swapped, much like optimization within 
combinatorial problems such as TSP. Finally, the Euclidean 
distance between the centroids of identical shapes is used as 
a measure for removal of identical types which overlap. In 
this manner, the numbers, variety and coordinates of indi- 
vidual data points are ascertained. Carnevali, et al., [^, ap- 
plied simulated annealing to construct an image from known 
sets of shapes in the presence of noise. However, to the best 
of our knowledge, application of simulated annealing to dis- 
ambiguate overlapping shapes is a novel contribution. 

Algorithm Data Point Disambiguation 
Input: 



1. N Binarized shapes, shape[l..N]; Binarizcd pixel region B of 
overlapping points, height h and width w 

Output: Coordinates & numbers of independent data points 

2. for point-type shape[k] 

3. bound[k][m, n] — [h — height(shape[k]) .w — 'width(shape[k])] 
(* Determine bounds m, n for individual data point centroids 
from target image size. *) 

4. (* Initial centroid for point-type shape[k] *) 

5. centroidi[k][i, j] —r&nd^bound 

6. weight[i] — 1 (* All initial weights —1 

7. Ei — T — Cost(B, shape[k]^ centroidi, weight) Initial energy 
and temperature *) 

8. repeat{* rand fluctuation to kth co-ordinates 

9. centroid^[k][i, j]+ — round(rand*2 — 1) 

10. Ef — Cost(B, shape[k]. centroidf . weight) (* update cost 
after move; 

11. if Ei > Ef 

12. then centroidi — centroid f (* accept move *) 

13. else accept with probability exp[— (i?/ — Ei)/T] 

14. if exp[-(B/ - Ei)/T] <rand 

15. then centroidi — centroid f (* accept move *) 



16. until Ef < e 

17. if distance{centroidi[k][i, j], centroidi[l][i, j]) ^ 

18. then weight[k] — (* Every a steps, remove one of two 

identical overlapping points fc 

19. T — T * (1 — e) {* Every /3 steps, reduce temperature *) 

20. tmp — centroidi[k][i^ j] 

21. centroidi[k][i, j] — centroidi[l][i, j] 

22. centroidi[l][i, j] — imp (* Every 7 steps, swap two point types *) 



Algorithm Cost Calculation 
Input: 



1. B , shape[k], centroidi , weight 
Output: cost 

C— zeros(size(_B)) (* Create empty matrix C with dims, of B *) 

2. p—\cngth{w eight) 

3. for k ^ — p 

4. do C[centroidi[k][i] : X, centroidi[k][j] : Y] | 

5. {shape[k] * weight[k]) (* logical OR between range of 
indices in matrix C and candidate points of size X, Y *) 

6. return Trace[(i5 — C)'*(-B — C)] (* trace of Grammian, transpose 
of (B-C) times (B-C) *) 

5. EXPERIMENTS 

In this section, we report the results obtained by evaluating 
the new features for 2-D plot identification and data point 
disambiguation algorithms. The data set that we used for 
our experiments is randomly selected publications crawled 
from the web site of Royal Society of Chemistry www.rsc.org 
and randomly selected computer science publications from 
the CiteSeer digital library ^ for scientific publications. 

5.1 2-D figure Classification 

For our classification experiments, we extracted the images 
from the afore-mentioned documents and had them manu- 
ally tagged by two volunteers as 2-D or non 2-D. Our set 
consists of 2494 images, out of which 734 images are 2-D 
plots. As mentioned previously, we train a linear SVM(with 
C — 1.0) on this dataset. 



Features 


% CV(#3) accuracy 


Only IS 


85.24 


Only CT 


78.3 


IS + CA 


85.85 


CT + CA 


80.67 


IS + CT 


85.85 


AU 


88.25 



Table 1: Cross-validation accuracies 



Class 


Non 2-D 


2-D 


Non 2-D 


1393 


67 


2-D 


82 


452 



Table 2: Confusion matrix(train set) 



Class 


Non 2-D 


2-D 


Non 2-D 


273 


27 


2-D 


66 


134 



Table 3: Confusion matrix (sample test set) 
5.1.1 Feature extraction 

Table 1 shows the 3-fold cross-validation accuracies with dif- 
ferent combinations of features. We use the following abbre- 
viations: IS for image segmentation, CT for caption text, CA 
for the coordinate axes. The confusion matrix over a sam- 
ple test set is shown in Table 3. For comparison purposes, 
we have also shown the confusion matrix over the training 
set in Table 2. The libSVM software was used for support 
vector classification [3]. 

5.2 Data Point Disambiguation 

For the purposes of our experiment, 90 x 90 sized images 
of overlapping points were generated randomly using two 
types, a diamond (A) and triangle (B) . Fig. 3 gives typical 



examples of pixel regions containing overlapping data points 
and the corresponding machine-learnt version; table 4 de- 
tails the experimental parameters and results corresponding 
to fig. 3. 



Figure 3: Examples of overlapping data points (left) 
and machine learnt versions (right) 



Iterations 


Temp. 


Type 


Offset 


Offset 




const. 




(orig.) 


(calc.) 


10k 


0.4 


A 


(11,39) 


(11,40) 








(35,19) 


(34,20) 








(19,4) 


(20,3) 






B 


(21,35) 


(22,35) 








(10,18) 


(10,17) 


10k 


0.3 


A 


(29,24) 


(29,23) 








(22,9) 


(21,9) 








(23,37) 


(21,39) 






B 


(2,39) 


(2,39) 








(18,17) 


(18,16) 



Table 4: Example parameters for simulated anneal- 
ing applied to the data point disambiguation prob- 
lem. 

Table 5 gives the overall results of these experiments using 
an annealing constant of 0.4 and 10k iterations. As the 
annealing schedule is slowed and iterations increased, the 
recall approaches 100%. A slower annealing schedule than 
that used here and more iterations are required as the pixel 
region and number of possible different data points increases. 
However the results are promising in that data that would 
traditionally be considered lost is recovered with fairly high 
accuracy. 



Shape 


Total 


# Correct 


% Recall 


Diamond 


72 


64 


88.9 


Triangle 


78 


71 


91.0 



6. CONCLUSIONS AND FURTHER WORK 

We have outlined a system that can identify 2-D plots in 
digital documents and extract data from the identified doc- 
uments. Overlapping data points present a major challenge 
in reconstructing data series from within the plotting re- 
gion, once lines are filtered from 2-D plots. We present an 
unsupervised machine-learning algorithm to segregate over- 
lapping data points and identify their exact shape and loca- 
tion. The work presented here is currently being integrated 
into the overall figure extraction system. In addition, at- 
tention is being given to improving the quality of extracted 
textual information, to assist in indexing of figures. 
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Table 5: Experimental Results for Data-Point Dis- 
ambiguation 



